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Abstract The Kolmogorov-Arnold stochasticity parameter technique is 
applied for the hrst time to the study of cancer genome sequencing, to reveal 
mutations. Using data generated by next generation sequencing technologies, 
we have analyzed the exome sequences of brain tumor patients with matched 
tumor and normal blood. We show that mutations contained in sequencing 
data can be revealed using this technique thus providing a new methodology 
for determining subsequences of given length containing mutations i.e. its 
value differs from those of subsequences without mutations. A potential 
application for this technique involves simplifying the procedure of hnding 
segments with mutations, speeding up genomic research, and accelerating its 
implementation in clinical diagnostic. Moreover, the prediction of a mutation 
associated to a family of frequent mutations in numerous types of cancers 
based purely on the value of the Kolmogorov function, indicates that this 
applied marker may recognize genomic sequences that are in extremely low 
abundance and can be used in revealing new types of mutations. 
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1 Introduction 


To study mutations in the genomic sequences of cancerous tissues we use 
the statistic introduced initially by Kolmogorov and later developed by 
Arnold 11.0 in dehning a degree of randomness (stochasticity) for a given 
sequence of real numbers. The universality of the method has been revealed 
at measuring the degree of randomness of hnite sequences in theory of dy¬ 
namical systems and in number theory |^. This approach has been applied 
to physical problems, i.e. at the study of non-Gaussianities in cosmic mi¬ 
crowave background radiation 10 , of X-ray galaxy clusters fTTl. This 


method was instrumental in detecting the thermal trust effect (Yarkovsky- 
Rubincam effect) for the hrst time in the properties of satellites probing 
General Relativity [T^. 


2 Method 

Let us briefly introduce the technique and the descriptors which were then 
applied to the genomic data. For {Xi, X 2 ,..., X„} n independent real-valued 
variable ordered in increasing manner Xi < X 2 < ■ ■ ■ < X„ the cumulative 
distribution function (GDF) is dehned as F{x) = P{X < x} i.|6|.0- 
The empirical distribution function Fn{x) will be 


Fn{x) 


0, 

X < Xi ] 


k/n , 

Xk ^ X <C k 1,2,.. 

., n — 1 

1 , 

Xn ^ X . 



Then the stochasticity parameter is dehned as 

A„ = a/u sup \Fn{x) - F{x)\ . (1) 

X 

Kolmogorov’s theorem states that for any continuous CDF F the follow¬ 
ing limit is converged uniformly 

lim P{Xn <X} = $(A) , (2) 

n^oo 

where the d>(0) =0, 

+00 

^ > 0 ^ ( 3 ) 
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and the distribution (Kolmogorov’s) $ is independent on F. For small values 
of A the following approximation yields 


$(A) 



(4) 


This method thus provides the measure of the degree of randomness 
(stochasticity) for sequences of n values within the interval of A„ [0.3, 2.4] 
see also [I^ . 


3 Data 


The following data have been used for the analysis. Gliomas are the most 
frequent malignant tumors of the CNS and are dehned by WHO grade I to 


grade IV classihcation standards and histopathological features 14 ,11^. Ap¬ 


plication of novel techniques to elucidate the fundamental genetic mutations 
in Grade II-III astroctyomas and Grade IV glioblastomas is a critical next 
step in glioma research. Here, we interrogated exome data of 30 brain tumor 
patients from the Preston Robert Tisch Brain Tumor Genter at Duke Uni¬ 
versity as described previously [^. The exomes of 30 patients were selected 
as they provided a large enough dataset to conduct our initial analysis. Each 
case contained four datasets corresponding to aligned paired end sequencing 
data hies (both a forward and reverse hie for each patients tumor and nor¬ 
mal blood). Included in this study were 15 grade HI astrocytomas and 15 
grade IV glioblastoma. Samples on average yielded 36 and 32 somatic mu¬ 
tations for grade HI astrocytomas and grade IV glioblastoma, respectively. 
Of particular interest to the general cancer community is the prevalence of 
highly recurrent mutations across all types of cancer. To this end, a list of 
highly recurrent mutations occurring in 23 genes commonly seen in cancer 
was used to interrogate the dataset (Table 1). The selected genes are from 
a list of frequently mutated genes in cancer provided by Personal Genome 
Diagnostics (Baltimore, MD). Next, we surveyed the exome data to identify 
any occurrence of these 407 highly recurrent mutations located within these 


23 commonly mutated genes, see 16 . 
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Gene 

Symbol 

Gene 

Description 

Transcript 

Accession 

ABLl 

c-abl oncogene 1; non-receptor tyrosine kinase 

X16416 

AKTl 

v-akt murine thymoma viral oncogene homolog 1 

ENST00000349310 

AKT2 

v-akt murine thymoma viral oncogene homolog 2 

NM 001626.3 

ALK 

anaplastic lymphoma receptor tyrosine kinase 

NM 004304 

BRAF 

v-raf murine sarcoma viral oncogene homolog B1 

NM 004333 

CDK4 

cyclin-dependent kinase 4 

NM 000075.2 

EGFR 

epidermal growth factor receptor 

NM 005228 

ERBB2 

v-erb-b2 erythroblastic leukemia viral 



oncogene homolog 2 

NM 004448 

FGFRl 

fibroblast growth factor receptor 1 

NM 023110 

FGFR3 

fibroblast growth factor receptor 3 

NM 000142 

FLT3 

fms-related tyrosine kinase 3 

NM 004119 

HRAS 

v-Ha-ras Harvey rat sarcoma viral 



oncogene homolog 

NM 005343 

IDHl 

isocitrate dehydrogenase 1 (NADP-I-); soluble 

NM 005896.2 

IDH2 

isocitrate dehydrogenase 2 (NADP-I-); mitochondrial 

NM 002168.2 

JAK2 

Janus kinase 2 

ENST00000381652 

KIT 

v-kit Hardy-Zuckerman 4 feline sarcoma viral 



oncogene homolog 

NM 000222 

KRAS 

v-Ki-ras2 Kirsten rat sarcoma viral 



oncogene homolog 

NM 004985 

MET 

met proto-oncogene (hepatocyte growth ) 



factor receptor) 

NM 000245 

MET 

met proto-oncogene (hepatocyte growth 



factor receptor) 

NM 001127500 

NRAS 

neuroblastoma RAS viral (v-ras) oncogene homolog 

NM 002524 

PDGFRa 

platelet-derived growth factor receptor; 



alpha polypeptide 

NM 006206 

PIK3CA 

phosphoinositide-3-kinase; catalytic; 



alpha polypeptide 

NM 006218.1 

RET 

ret proto-oncogene 

NM 020975 


Table 1: 23 genes commonly mutated in cancer. 
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4 Analysis 

The data for 30 patients have been analysed in the following manner. First, 
for each dataset the cumulative distribution function has been obtained 
for 3-combinations of nucleotides (guanine, adenine, thymine, and cytosine 
(G,A,T,C) i.e. of codons. For comparison, in one and two-based analysis of 
the same data, no CDF was possible to dehne due to large scatter in the fre¬ 
quency counts, however it was possible for the 3-base empirical distribution 
for each particular genomic sequence, followed by obtaining of the stochas- 
ticity parameter and then the Kolmogorov’s function <h. The fact that <F 
was possible to define only for 3-base CDF can be considered as a genuine 
feature linked to the nature of the genetic coding (first noticed by cosmolo- 
gist Ceorge Camow in 1950s) determined by the chemical properties of the 
molecules forming the nucleotides. 

The above described exome data were represented in a format of over 50 
min rows, of 100-base each, i.e. total in about 10®-valued sequences. We 
analysed such datasets for 30 patients, a block of 4 datasets, representing 
paired-end sequencing, was available for each patient, including those for 
blood and tumor, denoted as normal and with tumor, respectively. The 
following aim was inquired into: whether the K-A technique is able to dis¬ 
tinguish the strings i.e. sequence pieces of given length, with and without 
mutations, for a given sample of mutations. Figure 1 represents the results 
for computations for 100-base rows containing such mutations in all 119 
blocks (dark column in the right; 1 file was corrupted), and the mean of the 
function <F for 50 rows without mutations (light-colored column in the left) 
in the same sequence where the former mutations have been located. The 
same procedure has been repeated for shorter i.e. for 50 and 25-base strings 
with and without mutations (the right two double-columns, respectively). 
The shorter, 50-base strings were generated in the following manner: if the 
mutation is located completely either in the hrst or the second half of the 
100-base string, then the corresponding halves were included in the analysis. 
In the case of partial location, the proper number of bases was included from 
either side, the mutations are included completely even at partly passage to 
the next row. Similar was the case for 25-base strings, while for no-mutation 
strings (rows) their initial 50 or 25-base parts were included in the analysis. 
For no-mutation rows their alignment i.e. their position in the sequence was 
not important. The error bars correspond to standard errors. The analy¬ 
sis was performed by means of a software created in Pascal in environment 
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Delphi and intended to be made public in due course. The CPU time for 
one sequence (about 10® nucleotides) is about 1 hour for i7, 2600 3.4 GHz 
processor of 6GB memory. 



Word 100 Word 50 

□ NoMC 0 With M Code 


Word 25 


Figure 1: The function <F averaged for rows with mutations (dark bars) and 
normal ones (light bars) averaged over 50 rows, for 100, 50 and 25-base rows 
(here denoted as word), correspondingly. Error bars correspond to standard 
errors. 


Then, rows containing the most frequent specihc mutations in the same 
dataset of 30 patients (Table 2), have been analysed. Obviously, more fre¬ 
quent mutations provide higher statistics, and Table 2 and Figure 2 are repre¬ 
sented to show the scales of the input frequency numbers vs the results. The 
mutations i.e. the genes and the mutant positions and amino acid changes 
are known for the codes listed in Table 2, and from the individual muta¬ 
tion reports of the performed studies one can list all mutations contained 
within each tumor; we intend to address these issues in next publications 
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Table 2: The codes (MC) and frequency counts for 7 highest recurrent mu¬ 
tations in the studied database. 


N 

MC 

Frequency 

1 

AAAGANAAATT 

3032 

2 

CATCTNAAAAA 

2569 

3 

AGAAGNAAAAA 

2427 

4 

ATTTTNTCTTT 

2409 

5 

TGCCCNGGCTG 

2253 

6 

CTTTCNTTTCT 

2200 

7 

TCTTCNTTTTT 

2123 


on the applications of method discussed here. The results for the mean $ 
with standard error bars are presented in Figure 2. One can see that certain 
specihc mutations can be distinguished by the value of mean <h. 

5 One step further: possibility for discover¬ 
ing new mutations 

Up to now we were estimating the Kolmogorov function $ for genomic se¬ 
quence pieces (rows) with known mutations and without (of those) mutations. 
We reveal the differences in the mean values of <F for rows with and without 
those mutations. If so, then one can pose the inverse problem, namely, can 
one try to detect unknown mutations based on the value of <F, if estimated 
blindly in a given genome sequence. In Table 4 we show the results for a 
sample from the above studied database from genome with tumor (081T1): 
11 rows (of over 50 min) with <l> > 0.7 (only the rows with the number of 
unidentihed nucleodites N < 3 have been taken into account), have been 
revealed with the codes given in the Table 3 as candidates for mutations. 
Obviously, certain part of these cases can be just noise, i.e. without any as¬ 
sociation to real mutations, however, if at least part of these list when studied 
by conventional methods will be confirmed as associated to mutations, then 
one will have an explicit tool for detection of unknown mutations by means of 
this relatively simple, i.e. of small time and manpower consuming method. 
Certainly, the exhausting answer to this question will need comprehensive 


7 



c 

n 

V 

E 


WlOO, Phi, mean 50 [119 files] -l- StdError 

1.0 q- 

0.9 -- 



□ NoMC BWithMC 


Figure 2: The same as in Figure 1, but with the averaged values of the 
function $ for 100-base rows for the highest recurrent specihc mutations in 
the studied dataset as listed in Table 2. 


parallel studies with different analysis methods. However, for now, for il¬ 
lustration, for the given sample of detected candidates for mutations, we 
performed the following. We have sampled hve reads from the aforemen¬ 
tioned list of 11 rows in Table 3 and aligned them to the human genome for 
further investigation. While none of these hve reads matched to previously 
reported mutations in the COSMIC database, one of the reads aligned to a 
region of interest for oncogenomic laboratories. This read aligned to the tran¬ 
scriptional regulator ARIDIA, a SWI/SNF family member that is frequently 
mutated in numerous types of cancers, including gastric, ovarian, and pan- 
fT7],fT^, 19 , 20 . Probability for a chance coincidence is less 


creatic cancers 

than 10“® (assuming for simplicity an equipartition in the frequency counts. 



































































































Table 3: The codes for candidates for mutations in a genome sequence with 
high <h at given row numbers. 


Line N 

MC 

1186009 

TTGTGNAAGGG 

4073568 

CCACGNCCTGG 

11648505 

ATACANAACGC 

21240249 

AAGGANACTGA 

21827969 

CAATTNGGGAA 

33372865 

GCCGGNCGCGG 

34549019 

TGGCCNAGAAG 

36995074 

TGAAGNGTTCT 

42622891 

TTGTTNTTTTA 

43978737 

AGAAANATATT 

50647272 

TGACTNAAAGG 


cf. Table 2), thus proving the efficiency of the method for blind application 
to detasets using the prviously calibrated <h. 

Another potential application for this technique involves detection of rare 
variants of sequencing data, where the parameter $ may recognize genomic 
sequences that are in extremely low abundance and warrant further investi¬ 
gation by the researchers. 


6 Conclusions 

The following basic conclusions can be drawn from the above analysis: 

(a) the stability of the descriptor, that is small standard errors and hence 
high and stable conhdence level of the values of <h for paired-end sequence 
rows both for normal, i.e. without mutations, and those with mutations; 

(b) the difference in values of <h for rows with and without mutations; 

(c) the considered variations of string lengths still reflect the tendency; 

(d) rows with certain mutations can be distinguished by means of the 
used marker. 

The presented results demonstrate for the greater cancer research com¬ 
munity the power of the Kolmogorov-Arnold technique for identihcation of 
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mutations in paired-end genome sequencing data. In addition to the signifi¬ 
cance of revealing the important nature of the difference in the degree of the 
randomness between the genome sequence with and without mutations, the 
method also has an important application potential. It may be applied to 
non aligned sequencing segments, which may signihcantly simplify procedure 
of hnding segments with mutations and could speed up genomic research and 
its implementation in clinical diagnostic. 

Finally, the consideration of the inverse problem, namely, the revealing of 
a mutation associated to a tumour based purely on the computation of the 
value of the marker when blinded to the data, indicates that the latter may 
be used for detection of rare or new types of mutations. 
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